P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): - 1. fixed acidity (tartaric acid - g / dm^3) - 2. volatile acidity (acetic acid - g / dm^3) - 3. citric acid (g / dm^3) - 4. residual sugar (g / dm^3) - 5. chlorides (sodium chloride - g / dm^3 - 6. free sulfur dioxide (mg / dm^3) - 7. total sulfur dioxide (mg / dm^3) - 8. density (g / cm^3) - 9. pH - 10. sulphates (potassium sulphate - g / dm3) - 11. alcohol (% by volume) Output variable (based on sensory data): - 12. quality (score between 0 and 10)
We will analiyze the dataset to investigate the features that makes a good wine. In this problem we will use the White Wine dataset.
In this section, I will perform some preliminary exploration of your dataset.
Let’s start our analysis by summarizing the data and getting to know more about the dataset.
## [1] 4898 13
In our dataset there are 4898 rows and 13 features and the features are:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Let’s see the first rows of the dataset
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## 7 7 6.2 0.32 0.16 7.0 0.045
## 8 8 7.0 0.27 0.36 20.7 0.045
## 9 9 6.3 0.30 0.34 1.6 0.049
## 10 10 8.1 0.22 0.43 1.5 0.044
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## 7 30 136 0.9949 3.18 0.47 9.6
## 8 45 170 1.0010 3.00 0.45 8.8
## 9 14 132 0.9940 3.30 0.49 9.5
## 10 28 129 0.9938 3.22 0.45 11.0
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## 7 6
## 8 6
## 9 6
## 10 6
We got that our features are numerical and most are double and it seems X the id of the row.
Lets summarize the data to know more about the mean and perticiles of each feature.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
There are two categories of variables in this database:
We also noticed that those with higher variance are alcohol, free sulfur dioxide, total sulfur dioxide, residual sugar.
We will start analysing the quality that and those variables with high variance
Lets see how the wines were ranked.
We got that Wine Quality distribution look like a Normal Distribution with the most wines was ranked with a 6 followed by 5 and the lowest received a 3 and the highest a 9.
I wonder what can contribute to that grades? So lets look now at those with higher variance and how they are distributed.
As we have a descrite and specific set of values assigned to quality, I will factorize this feature. Doing so it will help us in the visualization by allowing us to perform better boxplots and investigate relationship of features.
WhiteWines$QualityCategory <-as.factor(WhiteWines$quality)
The porcentage of alcohol content of the wine.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We got a left skewed histogram with the most around 10 lowest a 8.00 and the highest at 14.20. I wonder if those with more than 10 have a highest ranking of quality.
The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
length(WhiteWines[(WhiteWines$residual.sugar < 1),]$residual.sugar)
## [1] 77
length(WhiteWines[(WhiteWines$residual.sugar > 45),]$residual.sugar)
## [1] 1
We also got a left skewed distribution with most of the data concentrated in less than 20 g/L. With the less at 0.60 and highest at 65.800 g/L and second highest at 31.60 g/L.
That been investigated we got that we have 77 less than 1 and 1 wine with residual sugar more than 45 g/L.
Then I transformed the residual.sugar to the log 10.
WhiteWines$log10.residual.sugar <- log10( WhiteWines$residual.sugar)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2218 0.2304 0.7160 0.6432 0.9956 1.8180
I transformed the long tail data to better understang the distribution of residual sugar. The residual sugar appears bimodal with the peaks in 0.25 and 0.8.
Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
It is a normal with it peak near 130 ppm, most of the data concentrated between 0-250 ppm and the minimum at 9 ppm and max at 440ppm.
## [1] 0.9899959
We also got that around 98.9% of the dataset have a more the SO2 more present in the wine.
The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Also a normal with concentration in beetween 0 and 100 and it peak near 50 and min value at 2 and max 289.
Lets see the others values as well. I suppose that ph and density may be also correlated.
Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
We get that all wines are acid (pH < 7) and it has a normal distribution most of it at 3.1 and min 2.720 and max 3.820.
The density of water is close to that of water depending on the percent alcohol and sugar content.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
It is a normal peak at of it at 0.9937 and min at 0.987 and max near 1.04. It is concentrated at 0.9 to 1.00. That means that density of most wines are very near to the water density, that is 1.
Is the amount of salt in the wine. I wonder how this can affect quality.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Is also a normal with peak at 0.043 and it is concentrated between 0.009 and 0,09 and the min is at 0.009 and max at 0.34. I suppose be the less than 0.04 have best quality,
A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
We got a binormal with peaks near 0.4 and 0.5. Min value is 0.2 and max 1.08.
Lwts see the acid. There are three types of acids: 1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily) 2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Comparasion:
In all of them we got a normal. We have more fixed acid than the others and it peak is in 7 when the other is in 0.3 and with a higher variance.
Lets aggregate the acids and see what it returns.
WhiteWines$acid <- WhiteWines$citric.acid + WhiteWines$fixed.acidity +
WhiteWines$volatile.acidity
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.890 7.405 7.467 7.960 14.960
When analyzing the new feature we got that it is more similar to fixed aciity with it peak at 7.5.
TODO:
In my dataset there are 4898 values with 13 features (quality, residual.sugar, sulphate,fixed.acid,t).
From our observations:
The main features in the dataset are quality,alcohol and density I’d like to determine which features are better to predict the quality of a wine. I suspect that a combination of density, alcohol and others features can contribute for it.
Residual sugar,chlorides,ph,sulphates and others my contribute to the quality of a wine.
I created three differente features one for factorizing the quality of the wine, other by transfoming the residual sugar feature to log10 and other by combining the three different acids (fixed, volative and citric). I did so in other to better visualize the relations between the differents features.
The residual sugar had a left skewed plot so I tried the log10 transformation and got a binormal distribution.
We will now see the correlation matrix. To see wich variables are correlated to quality and each other.
## X fixed.acidity volatile.acidity citric.acid
## X 1.00 -0.26 0.00 -0.15
## fixed.acidity -0.26 1.00 -0.02 0.29
## volatile.acidity 0.00 -0.02 1.00 -0.15
## citric.acid -0.15 0.29 -0.15 1.00
## residual.sugar 0.01 0.09 0.06 0.09
## chlorides -0.05 0.02 0.07 0.11
## free.sulfur.dioxide -0.01 -0.05 -0.10 0.09
## total.sulfur.dioxide -0.16 0.09 0.09 0.12
## density -0.19 0.27 0.03 0.15
## pH -0.12 -0.43 -0.03 -0.16
## sulphates 0.01 -0.02 -0.04 0.06
## alcohol 0.21 -0.12 0.07 -0.08
## quality 0.04 -0.11 -0.19 -0.01
## log10.residual.sugar 0.02 0.07 0.09 0.06
## acid -0.26 0.99 0.07 0.39
## residual.sugar chlorides free.sulfur.dioxide
## X 0.01 -0.05 -0.01
## fixed.acidity 0.09 0.02 -0.05
## volatile.acidity 0.06 0.07 -0.10
## citric.acid 0.09 0.11 0.09
## residual.sugar 1.00 0.09 0.30
## chlorides 0.09 1.00 0.10
## free.sulfur.dioxide 0.30 0.10 1.00
## total.sulfur.dioxide 0.40 0.20 0.62
## density 0.84 0.26 0.29
## pH -0.19 -0.09 0.00
## sulphates -0.03 0.02 0.06
## alcohol -0.45 -0.36 -0.25
## quality -0.10 -0.21 0.01
## log10.residual.sugar 0.93 0.07 0.31
## acid 0.10 0.05 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.16 -0.19 -0.12 0.01 0.21
## fixed.acidity 0.09 0.27 -0.43 -0.02 -0.12
## volatile.acidity 0.09 0.03 -0.03 -0.04 0.07
## citric.acid 0.12 0.15 -0.16 0.06 -0.08
## residual.sugar 0.40 0.84 -0.19 -0.03 -0.45
## chlorides 0.20 0.26 -0.09 0.02 -0.36
## free.sulfur.dioxide 0.62 0.29 0.00 0.06 -0.25
## total.sulfur.dioxide 1.00 0.53 0.00 0.13 -0.45
## density 0.53 1.00 -0.09 0.07 -0.78
## pH 0.00 -0.09 1.00 0.16 0.12
## sulphates 0.13 0.07 0.16 1.00 -0.02
## alcohol -0.45 -0.78 0.12 -0.02 1.00
## quality -0.17 -0.31 0.10 0.05 0.44
## log10.residual.sugar 0.42 0.76 -0.18 -0.03 -0.39
## acid 0.11 0.28 -0.43 -0.01 -0.12
## quality log10.residual.sugar acid
## X 0.04 0.02 -0.26
## fixed.acidity -0.11 0.07 0.99
## volatile.acidity -0.19 0.09 0.07
## citric.acid -0.01 0.06 0.39
## residual.sugar -0.10 0.93 0.10
## chlorides -0.21 0.07 0.05
## free.sulfur.dioxide 0.01 0.31 -0.05
## total.sulfur.dioxide -0.17 0.42 0.11
## density -0.31 0.76 0.28
## pH 0.10 -0.18 -0.43
## sulphates 0.05 -0.03 -0.01
## alcohol 0.44 -0.39 -0.12
## quality 1.00 -0.06 -0.13
## log10.residual.sugar -0.06 1.00 0.09
## acid -0.13 0.09 1.00
Looking more carefully with the row quality we got that quality is positivly and more correlated to alcohol. Other feature that is correlated but negatively is with density and clorides.
## density chlorides volatile.acidity
## -0.307123313 -0.209934411 -0.194722969
## total.sulfur.dioxide acid fixed.acidity
## -0.174737218 -0.131377207 -0.113662831
## residual.sugar log10.residual.sugar citric.acid
## -0.097576829 -0.064631762 -0.009209091
## free.sulfur.dioxide X sulphates
## 0.008158067 0.035763247 0.053677877
## pH alcohol quality
## 0.099427246 0.435574715 1.000000000
Lets see a visualization of this matrix.
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning in ggcorr(WhiteWines, geom = "blank", label = TRUE, hjust = 1):
## data in column(s) 'QualityCategory' are not numeric and were ignored
Analyzing the covariation map we got that most important ones are:
We got by a correlation of 0.84
We got almost like a linear correlation between the residual sugar and density points. That shows us that as density increases the residual.sugar also increase.
We can see that as alcohol increases there are less density. We can also get that the range of density is from 0.98 to 1.04 and alcohol range goes from 8 to 14.
We got that alsmost as density increases the total sulfur dioxide also increase.
In this part we will create a plot that involves the quality of wines.
Lets take a closer look at the relation between alcohol and density with quality.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
We see that as alcohol in the wines increases the quality tend to increase as well and that with alcohol percentage more than 10.8 tend to have better quality. Now lets take a look at density
Density is another feature that may have a correlation to quality. Lets analyse that correlation.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
It seems that low density tend to a better quality. As it goes to near 0.99 it’s Quality tend to increases.
According to the matrix of correlation chlorides also have a correlation with quality.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
As it shown in density, the clorides as it get lower than 0.02 it tend to have better quality.
I noticed some relationships with quality of wines that are higher alcohol, lower density and loweer clorides tend to have a better quality.
I noticed some relationships in the dataset. Density is increases proportionaly to residual sugar and sulfur dioxide but inverse proportionaly to alcohol.
We got that as the density increases so do residual sugar and sulfur dioxide but inverse proportinaly to alcohol.
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).
In this plot you can see that as the quality increases there are more alcohol at higher levels and lower densities.
As shown with the density, as the clhorides increases there are more alcohol at higher levels and lower densities.
The features that are strong related are alchol and quality and inversaly
related to chlorides.
Category plot
The wines were graded from 0 to 10 and the minimum got 3 and maximum 9.
The distribution also got a shape of a normal with it peak at 6.
Category and Alcohol plot
We got that it has a strong relationship beetween Quality of a Wine and alchol. As alcohol percentage increases so tend to increase the quality. Although the relationship is not 100% take level 5 per exmple where we have a mean leass than the 4 level of quality.
Category and Alcohol and Density plot
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 87 rows containing missing values (geom_point).
In this graph we compare two,density and alchol, variables correlated to quality of the wine. In this plot you can see that as the quality increases there are more alcohol at higher levels and lower densities. ——
The dataset contains an observation of 4898 wines and 13 features. And our goal in this project was to investigate the wine dataset to get the features that makes a good wine. We got that there are some features that are strong related to the quality of wine. They are alcohol, density and chlorides. Quality is proportinaly related to alcohol and inverse to density and chlorides. That means that as the quality grade increases the alcohol level also increase and the density and chlorides decreases. Through the visualization it make more simple to get a sense of how those features were related to each other and to the wine quality. In this project I hope to have succeded in the visualization and analysis of these dataset. And a struggle and suggestion to improve the dataset will be to take into account wines from differents parts of the world and features as soil and climate of the winery so, we can remove some bias and get more knowledge in the composition of what makes a great wine quality.